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Abstract 


Today’s GPUs (graphics processing units) are used for more than just 
graphics. They are finding new use cases in various areas with high 
demand for parallelism. To address this, both the GPUs and tools for 
their programming are constantly evolving. Our goal is to summarize 
and describe the innovations of modern NVIDIA GPUs and show the 
principles of GPU programming with CUDA and OpenCL. We first 
summarize features, innovations, and properties of NVIDIA GPUs 
of Pascal, Volta, Turning, and Ampere microarchitecture. Then we 
describe and explain the principles of GPU programming, both in 
CUDA and OpenCL. Lastly, we explore the new features and inno- 
vations added in CUDA versions from 8 to 11 and OpenCL 3.0 and 
demonstrate the advantages and new possibilities they introduce. 
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Introduction 


Two main types of programmable processors exist in today’s computer 
systems. The first are CPUs (central processing units), and the second 
are GPUs (graphics processing units). Modern CPUs generally have 
between two and 64 cores and are good for sequential processing. 
However, GPUs have thousands of cores, making them better suited 
than conventional CPUs for an acceleration of software with massive 
parallel execution. GPUs were initially dedicated to 3D graphics. To- 
day they are finding new use cases in areas with high demands on 
parallelism. One of the companies designing GPUs is NVIDIA. With 
the past few GPU generations, NVIDIA has introduced many new 
hardware features such as Tensor Cores used to accelerate Deep Learn- 
ing workloads. They also added new tools for general-purpose GPU 
programming like CUDA Cooperative Groups that extend the pro- 
gramming model for organizing groups of communicating threads. 

In this thesis, we aim to describe the features, properties, and inno- 
vations of recent generations of NVIDIA GPUs (from Pascal to Ampere 
microarchitecture) while highlighting some of their possible use cases. 
Our second goal is to detail the principles and differences of GPU 
programming with CUDA and OpenCL and describe features and 
tools introduced in new CUDA and OpenCL versions that accompany 
presented GPU generations. 

In chapters 1-4, we describe the basics of hardware architecture, 
features, properties, and innovations of NVIDIA GPUs of Pascal, Volta, 
Turing, and Ampere microarchitecture in each chapter, respectively. 
After that, in chapter 5, we introduce motivations behind general- 
purpose GPU programming and its benefits compared to program- 
ming and running programs on a CPU. We also describe the principles 
of GPU programming in CUDA and OpenCL, point out some of their 
differences and similarities, and provide various code examples. In 
the last two chapters, we describe innovations introduced by CUDA 
versions from 8 to 11 and OpenCL 3.0, accompanied by code examples 
demonstrating their use. 

In the beginning, we present innovations and features of modern 
NVIDIA GPUs and point out some of their benefits. After that, we 
detail principles of general-purpose GPU programming in CUDA and 
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OpenCL and provide examples that illustrate the basics as well as the 
use of more advanced tools and features. 


1 Pascal microarchitecture 


Nvidia unveiled the new generation of GPUs equipped with GP10x 
chips in April 2016. New GPUs of Pascal microarchitecture, namely 
Tesla P100 (with GP100 chip), brought performance increase over 
its predecessor and many new features. Such as NVLink, NVIDIA's 
high-performance interconnect that greatly accelerates GPU peer-to- 
peer and GPU-to-CPU communication, and other features improving 
programmability, such as Unified Memory and Compute Preemption. 
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1.1 Hardware architecture 


There is a notable divergence between the high-performance com- 
puting side and its GP100 GPU, and the consumer side GPUs with 
the GP102 GPU equipping the top-of-the-line consumer GPUs, like 
GTX 1080ti. In GP100 GPU, we find an extensive collection of features. 
Some are not present in lower-end GPUs, like faster FP64 and FP16 
operations, NVLink. Shared memory and register file capacity is also 
significantly greater in GP100 GPU than in other Pascal GPUs. [2] 


1.1.1 Data ceter GPUs 


The GP100 GPU features six Graphics Processing Clusters, each with 
10 SMs (Streaming Multiprocessors). GP100 has 4 MB of L2 cache 
and over a 14 MB of register file.[3] GP100's sixth-generation SM 
architecture improves CUDA Core utilization and power efficiency, 
resulting in significant overall GPU performance improvements and 
allowing higher core clock speed compared to previous GPUs. The 
SM incorporates 64 single-precision (FP32) CUDA cores. Each SM is 
partitioned into two processing blocks, each with 32 FP32 CUDA cores, 
an instruction buffer, a warp scheduler, and two dispatch units. Even 
though Maxwell and Kepler SMs had 128 and 192 FP32 CUDA cores, 
respectively, compared to 64 in Pascal, the GP100's SM maintains the 
same register file size and supports similar warp and thread block 
occupancy. GP100's SM has the same number of registers as Maxwell 
GM200 and Kepler GK110 SMs, but the entire GP100 GPU has far 
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more SMs, and thus many more registers overall. This means threads 
across the GPU have access to more registers, and GP100 supports 
more threads, warps, and thread blocks in flight compared to prior 
GPU generations.| 1] 


High Bandwidth Memory 2: 


Much larger problems are being solved by GPUs, requiring much 
larger datasets and higher demand for DRAM bandwidth. To address 
this demand for higher raw bandwidth, Tesla P100, equipped with 
GP100 GPU, uses HBM2 (High Bandwidth Memory 2). HBM2 en- 
ables a significant boost in DRAM bandwidth by changing how the 
DRAMs are packaged and connected to the GPU. Rather than requir- 
ing numerous discrete memory chips surrounding the GPU as in 
traditional GDDR5 GPU board designs, HBM2 includes one or more 
vertical stacks of multiple memory dies. P100 supports a bandwidth 
of 180GB/sec per stack, containing four stacks, a total bandwidth of 
720GB/sec per the whole GPU, which is 2-3 times higher than the 
bandwidth of its predecessors. 1 | 

Another benefit of HBM2 memory is native support for error- 
correcting code (ECC) functionality. ECC provides higher reliability 
for applications that are sensitive to data corruption. ECC technology 
detects and corrects single-bit soft errors before they affect the system. 
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Half-precision arithmetic operations: 


One new capability of GP100 single-precision CUDA cores is the abil- 
ity to perform two FP16 operations using a single paired-operation 
instruction.[1] This is especially useful in deep learning because most 
of the math required to train deep neural networks can be executed in 
a 16-bit half-precision floating-point. Deep neural networks are neural 
networks with three or more layers. These neural networks attempt to 
simulate the behavior of a human brain, allowing it to "learn" from a 
large amount of data. Deep learning drives many AI applications and 
services that improve automation, performing analytical and physical 
tasks without human intervention (image classification, speech recog- 
nition, self-driving, and others).[3, 4] In addition, 16-bit floating-point 
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cuts bandwidth and storage requirements in half. GP100 completes 
the FP16 instruction set with 16-bit atomics, such as atomic ADD, MIN, 
and MAX, allowing lockless cooperation between GPU threads.[3] 
GP100 also changed the double-precision CUDA Cores ratio to 2:1 
from Kepler GK110's 3:1 single-precision to double-precision units.|1 | 


1.1.2 Consumer GPUs 


Consumer side GP102, GP104, and GP106 (from now on GP10X) 
mainly power GPU lineups for gaming PCs and professional worksta- 
tions. The top-of-the-line GP102 chip contains 3MB of L2 cache and 
comprises 6 GPCs (Graphics Processing Clusters) and 30 SMs (Stream- 
ing Multiprocessors) compared to 24 and 16 in Maxwell's GM200 and 
GM204, respectively. Each SM is paired with a PolyMorph Engine 
that handles vertex fetch, tessellation, and various other features used 
mainly in graphics processing. One SM plus one Polymorph Engine 
compose a TCP unit (Texture Processing Cluster). Each SM contains 
128 CUDA cores, a 96KB shared memory unit, and a 48 KB of L1 cache 
storage.[5] 


GDDR5X: 


The GTX 1080ti and GTX 1080 equipped with GP102 GPU and GP104 
GPU contain GDDR5X memory. The GTX 1080’s GDDR5X memory 
operates at 10Gbps. A new GPU circuit architecture and board channel 
design were needed to make these speeds possible. The Pascal GPUs 
also feature an enhanced delta color compression capability that allows 
the GPU to use its available memory bandwidth more efficiently. As a 
result, GTX 1080's memory subsystem effectively has up to 1.7x more 
bandwidth than its direct predecessor, the GTX 980, equipped with 
GM204 GPU. [5] 


Simultaneous Multi-Projection: 


The GP10X Polymorph Engine also includes a new hardware unit 
called Simultaneous Multi-Projection (SMP) unit. The SMP unit gen- 
erates multiple projections of a single geometry stream as it enters 
the SMP engine from upstream shader stages. The SMP engine can 
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process geometry through up to 16 preconfigured projections, sharing 
the center of projection and up to two different projection centers, 
offset along the X-axis. One example application of SMP is optimal 
support for surround displays, with a different projection for each 
display, matching the display angle.[5] 

An important aspect of VR rendering is the requirement that at 
least two projections need to be generated, one for each eye. As in 
the surround display case, this had to be done by rendering to each 
eye separately, resulting in twice the amount of work for the entire 
pipeline. However, the GPU can now render the two stereo projections 
directly in a single render pass because the SMP engine supports two 
separate projection centers. This SMP capability is known as Single 
Pass Stereo.[5] 


1.2 NVLINK 


NVLink is designed as a high-bandwidth GPUSGPU or GPUSCPU 
interface. It supports load /store semantics that let programmers di- 
rectly read or write local graphics memory and peer graphics memory 
or the CPU’s memory using NVLink, all in a common shared ad- 
dress space. NVLink allows clustering of GPUs or GPUs and CPUs, 
which can then be approached as one computing element. NVLink 
uses NVIDIA's High-Speed Signaling interconnect, which transmits 
data over a differential pair running at up to 20 Gb/sec. Eight of these 
differential connections form a Sub-Link that sends data in one di- 
rection and two sub-links, one for each direction, from a Link that 
connects two processors. A single Link supports up to 40 GB/sec of 
bidirectional bandwidth between endpoints. The NVLink implemen- 
tation in GP100 GPU supports up to four Links, enabling maximum 
bidirectional bandwidth of 160 GB/sec.[1, 3] 


1.3 Unified Memory 


The NVIDIA Fermi microarchitecture enabled full support for C and 
C++ pointers by implementing unified GPU address space spanning 
the three main GPU memory spaces. However, this unified address 
space only applies to GPU memory addressing.[ 1] 
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Then CUDA 4 introduced UVA (Unified Virtual Addressing) to 
provide a single virtual memory address space for both CPU and 
GPU memory and enable pointers to be accessed from GPU code 
no matter where in the system they reside. UVA enables Zero-Copy 
memory, a pinned CPU memory accessible by GPU code directly, over 
PCle, without the need for memory copy. This provides some of the 
convince of Unified Memory, but at the cost of worse performance, 
because GPU always accesses it with PCle’s low bandwidth and high 
latency.[1] 

Later, CUDA 6 introduced Unified Memory, which creates a pool 
of managed memory that programs running on the CPU and GPU 
can access without explicit data movement. However, only when CPU 
and GPU processes are not running together because of the limitation 
of the Kepler and Maxwell GPU microarchitecture. Also, the Unified 
Memory address space was limited to the size of the GPU memory.|1, 
3] 

CUDA 8 and Pascal microarchitectures improve Unified Memory 
functionality by adding 49-bit virtual addressing and page faulting 
capability. The larger 49-bit virtual addresses are sufficient to enable 
GPUs to access the entire system memory plus the memory of all GPUs 
in the system. Because of the memory page faulting functionality, the 
CUDA system software does not need to synchronize all managed 
memory allocations to the GPU before each kernel lunch. Instead, 
when a thread running on GPU faults on non-resident memory access, 
it stalls until the page can be migrated and the page table updated. 
Alternatively, the page may be mapped for remote access over PCIe or 
NVLink interconnects.[1, 3, 6] 

These new features of Unified Memory enable oversubscription 
of memory, which means that application running on a GPU can use 
data sets larger than ten their device memory.[1] 

While the Unified Memory model makes GPU programming more 
convenient, it comes at a cost; handling page faults and page migra- 
tions can be expensive. CUDA 8 addresses this issue with features like 
prefetch and memory advice. 
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1.4 Compute Preemption 


Starting with Pascal microarchitecture, NVIDIA GPUs offer fine-grained 
preemption at thread level for mixed graphical/compute tasks; in con- 
trast to Maxwell and older, where graphics and mixed graphic/com- 
pute tasks could only be switched at the boundary of draw calls.|2 | 

Pascal GPUs also allow compute tasks to be interrupted at instruction- 
level granularity, with their context swapped to GPU DRAM. This 
allows other applications to be swapped in and run, followed by the 
original task’s context being swapped back and continuing execution 
where it left off.[1] 

The Fine-grained preemption solves some of the problems of ill- 
behaved applications or long-running applications that monopolize 
the GPU and make graphical tasks unresponsive can now be inter- 
rupted and resumed.|[1] 


2 Volta microarchitecture 


GPUs of Volta microarchitecture aimed mainly at professionals and as 
data centers accelerators were released in 2017. Volta introduces new 
features aimed to accelerate neural network training and inference 
workloads, like Tensor Cores, and improve programmability with 
independent thread scheduling.[7] 


2.1 Hardware Architecture 


GPUs of Volta microarchitecture are equipped with GV100 chips. A 
full GV100 GPU consists of six GPCs (Graphics Processing Clusters), 
each with 14 SMs (Streaming Multiprocessors); 4 more SMs per GPC 
compared to Pascal's GP100, which also had 6 GPCs. GV100 GPU also 
has more L2 cache and register files than GP100 GPU. [7] 

Volta features a new SM architecture that delivers improvements in 
performance and energy efficiency, where it offers 50% higher energy 
efficiency on general compute workloads. One of the major additions 
to SMs are mixed-precision Tensor Cores, used for matrix arithmetic, 
presented in more detail in Chapter 2.2. Similar to Pascal GP100, the 
GV100 SM incorporates 64 FP32 cores and 32 FP64 cores per SM. 
However, the GV100 SM uses a new partitioning method to improve 
SM utilization and overall performance. The GV100 SM is partitioned 
into four processing blocks, in contrast to two in GP100, each with 
16 FP32 Cores, 8 FP64 Cores, 16 INT32 Cores, two mixed-precision 
Tensor Cores, and a new LO instruction cache that provides higher 
efficiency than the instruction buffer used in prior NVIDIA GPUs. 
Even though GV100 SM has the same number of registers as Pascal 
GP100 SM, the entire GV100 GPU has more SMs, and thus many more 
registers overall. Therefore, GV100 supports more threads, warps, and 
thread blocks in flight compared to GP100 GPU.[7] 

In Volta GV100 GPU, L1 data cache and shared memory func- 
tionality are combined into a single memory block, with a combined 
capacity of 128 KB per SM. Volta supports several shared memory/L1 
data cache configurations: 0/128, 8/120, 16/112, 32/96, 64/64, or 96/32 
KB per SM.[8] This gives programmers more flexibility and poten- 
tially more cache capacity, like when they do not use shared memory. 
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Also, using L1 cache instead of shared memory enables programmers 
to get good performance with less programming effort because the 
integration of L1 cache within shared memory block ensures that the 
L1 cache has much lower latency and higher bandwidth than in older 
NVIDIA GPUs. It is important to note that using shared memory is 
still the best choice for maximum performance. Another step to further 
improve performance over prior NVIDIA GPUs is the introduction of 
write-caching (caching of store operation).[7] 

In contrast to Pascal GPUs, Volta GV100 SM includes separate FP32 
and INT32 cores, which allow simultaneous execution of FP32 and 
INT32 operations at full throughput while also increasing instruction 
issue throughput.[7] For example, applications can now in each iter- 
ation of loop interleave pointer arithmetic, like updating addresses, 
with floating-point computation, like processing the current iteration 
at full FP32 throughput. 


2.2 Tensor Cores 


Neural networks are increasingly being used in various areas, such 
as flight control, automotive control, medical systems, and many 
more.[9] This development motivated hardware manufacturers to 
find new solutions that could accelerate deep learning and inference 
workloads, like acceleration of GEMM (General Matrix Multiply) op- 
erations at the core of neural network training.[7] NVIDIA addressed 
this demand with Volta’s Tensor Cores. They are specialized comput- 
ing units capable of performing one matrix-multiply-and-accumulate 
on 4x4 matrices per clock cycle. This means that a Tensor Core can 
perform 64 floating-point Fused-Multiply-Add (FMA) operations in 
one clock cycle. An FMA operation takes input values in half-precision 
while output values can be either half- or full-precision. Tesla V100 
accelerator provides 640 tensor cores with a theoretical peak perfor- 
mance of 125 Tflops/s in mixed precision.| 10 | 

Tensor Cores can be thought of as accelerators within an accelera- 
tor. They add considerable computing performance boost, work on 
low precision, and have their own local memory consisting of frag- 
ments.[10] The Tensor Core programming model also differs from 
the original CUDA programming model. In the CUDA programming 
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model, each thread’s data is private to itself. The only way to access 
data in other threads is to communicate with global/shared memory 
or shuffle instructions explicitly. In contrast, the Tensor Core instruc- 
tion HMMA requires threads within a warp to cooperate, meaning 
it allows different threads to communicate data with each other im- 
plicitly.[11] The programming of Tensor Cores is described in chapter 
6.2.2. 

Regarding performance, Markidis et al.[10]achieved 74% of the- 
oretical performance (83Tflops/s) of Tensor Cores, which is about 
six and three times the performance of GEMM in full and single pre- 
cision, by using cuBLAS GEMM on 8192 x 8192 matrices. They also 
tested CUDA 9 WMMA (warp-level matrix operation) implementa- 
tion, which runs on Tensor Cores, to single and half-precision matrix 
multiplication run on CUDA cores. The implementation utilizing Ten- 
sor Core outperformed CUDA core implementation by a factor of five, 
but only when shared memory was used. Yan et al[11] ) noted in their 
paper after benchmarking memory and Tensor Cores that DRAM and 
L2 cache bandwidth will be a bottleneck for Tensor Cores if they do 
not use big enough blocking size of thread blocks and warps. These 
findings are congruent with the conclusion of Martineau et al.[12] 
that the performance gains provided by Tensor Cores are strongly cor- 
related with the performance of the GPU memory hierarchy. All these 
findings suggest that the focus of optimization is likely to be strategies 
for load/store memory instruction scheduling and data placement in 
GPU memory if we want to achieve the peak performance of Tensor 
Core. 

The possible downside of Tensor Cores is their limited precision 
(matrix multiplication is performed in half-precision). NVIDIA mainly 
advertises their use for deep learning, where the studies on the impact 
of low precision multiplication on the training of deep neural net- 
works and their accuracy suggest that multiplication in half-precision 
(sometimes even lower) with some changes to the training algorithms 
provide comparable accuracy to training with full precision matrix 
multiplication[13, 14, 15]. However, Tensor Core viability in other 
HPC applications is unclear. It is possible to decrease the rounding 
error for the cost of increased computation with various techniques, 
but their success rate and cost vary. Ultimately the impact of mixed- 
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precision matrix-multiply-and-accumulate on various HPC applica- 
tions requires more research. 


2.3 Independent Thread Scheduling 


GPUs of Pascal and earlier microarchitecture use a single program 
counter shared among all 32 threads of a warp, combined with an 
active mask that specifies which threads of the warp are active at any 
given time. Consequently, the execution of threads in divergent execu- 
tion paths is serialized until the point of convergence. This approach 
has a significant drawback; with temporary loss of concurrency, the 
threads in divergent sections cannot signal each other or exchange data. 
The lack of forward -progress for the suspended threads means that 
programmers had to avoid fine-grained synchronization because it 
could lead to deadlocks or use lock-free or warp-aware algorithms.[7 | 
Volta addresses this by maintaining execution state per thread, 
including program counter and call stack, which allows the GPU to 
yield execution of any thread, either to make better use of execution 
resources or to allow one thread to wait for data to be produced by 
another thread. This gives programmers more flexibility and allows 
them to implement fine-grained and starvation-free algorithms. At 
the same time, Volta’s schedule optimizer still tries to group active 
threads from the same warp to maximize parallel efficiency.[7 ] 


12 


3 Turing microarchitecture 


NVIDIA released its new line-up of GPUs with Turing microarchitec- 
ture in the second half of 2018. This generation comes with features, 
such as Real-Time Ray Tracing and shading advancements, mainly 
focused on increased realism and performance of PC games and pro- 
fessional graphic applications. As well as a new generation of Tensor 
Cores with advancements focused on accelerating deep learning in- 
ferencing. 


3.1 Hardware architecture 


The top-of-the-line GPU of Turing microarchitecture, the TU102 GPU, 
consists of six GPCs (Graphics Processing Clusters). Each GPC in- 
cludes a dedicated raster engine and twelve SMs (Streaming Multi- 
processors). The whole TU102 GPU comes with 6 MB of L2 cache.| 16 | 

The new Turing Streaming Multiprocessor is partitioned into four 
processing blocks, each with 16 FP32, 16 INT32 Cores, and in contrast 
to Volta’s 2 FP64 cores for compatibility reasons. The Turing SM incor- 
porates many of the features introduced in Volta microarchitecture, 
like independent thread scheduling, concurrent execution of FP32 and 
INT32 operations, and unified architecture for L1 cache and shared 
memory. While also advancing other like the new generation of Tensor 
Cores, that now offer INT8 and INT4 precision modes for inference 
workloads that can tolerate lower precision. On top of that, each SM 
contains one RT core for ray tracing acceleration.[ 16] 


3.1.1 GDDR6 


As display resolution continues to increase, shader functionality and 
rendering techniques become more complex, memory bandwidth 
and size play a more significant role in GPU performance. Turing 
GPUs utilize GDDR6 memory. On top of 20% increase in bandwidth, 
Turing's GDDR6 memory subsystem delivers 20% power efficiency 
improvements over the GDDR5X memory used in Pascal GPUs.[16 | 
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3.2 RT Cores 


Ray tracing is a method of graphics rendering that simulates the phys- 
ical behavior of light.[17] In the real world, the objects around us are 
illuminated by a light source, and photons can bounce from one ob- 
ject to another before reaching the viewer's eyes. Ray tracing imitates 
those effects by tracing the path of the light rays from the view camera 
(2D viewing surface) out into a 3D model of the scene. When a ray 
encounters objects in the scene, like traveling through or bouncing 
off them, the color and lighting information from all these objects 
contribute to the pixel color and illumination level.[18] This technique 
makes it possible to render physically accurate reflection, refractions, 
shadows, and indirect lighting.[16 | 

The downside of ray tracing is that it requires a lot of computational 
resources. Therefore, it has not been used in real-time applications 
like games, in favor of rasterization and static lightmaps, which are 
faster but can cause a lot of different types of visual artifacts, like light 
leaks or aliasing. The Turing GPUs address this problem by providing 
hard ware-accelerated ray tracing with RT cores. Although, even with 
hardware acceleration, ray tracing is still computationally intensive. 
Therefore the best approach to render scenes in real-time is the use of 
hybrid rendering. Hybrid rendering is a combination of ray tracing 
and rasterization. Rasterization is used where it is most effective, and 
ray tracing is used where it provides the most visual benefit compared 
to rasterization, such as reflections and shadows.[16 |Beyond that, rays 
can be used with even more advanced techniques to generate more 
realistic images, e.g., the path traced global illumination, in which 
rays represent virtual photons in a physical simulation. [19] 

To perform ray tracing, we need to test rays for intersection with 
objects in the scene. To avoid testing every triangle in a scene against 
each ray query, a BVH (bounding volume hierarchy), a tree of axis- 
aligned bounding boxes, is created over the triangles in the scene. 
Then, when a ray probe launches, the BVH tree is traversed, and 
each successive child box is tested against the ray. Finally, the rays 
are tested against the triangles that make up the scene at the leave 
nodes[19]With past GPU generations, this process had to be done by 
shader operations and take up thousands of instruction slots per ray 
cast to test against a bounding box intersection in the BVH until hitting 
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a triangle.[16] On Turing RTX GPUs, a ray quarry is sent from the SM 
(Streaming Multiprocessor) to the RT core. The RT core fetches and 
decodes memory representing the part of the BVH and uses dedicated 
evaluators to test each ray against the box (or triangles at the leave 
nodes). Moreover, the RT cores are faster and more efficient than the 
software emulation of ray tracing, and they alleviate a considerable 
amount of SM’s computational load.[19] 

The known facts about the inner workings of RT cores disclosed by 
NVIDIA are that the SM only has to lunch a ray probe, then the RT core 
does the BVH traversal and ray-triangle tests, and the RT core includes 
two specialized units. The first unit in RT cores does bounding box 
tests, and the second unit does ray-triangle intersection tests. In their 
paper V.V. Sanzharov et. Al. [20] ), tried to examine the inner working 
of the RTX technology. They used various path tracing algorithms 
and conducted several experiments on two different GPUs, Pascal’s 
GTX1070, which only has software implementation of ray tracing, and 
Turing generation’s RTX2070 with hardware-accelerated ray tracing. 
They concluded that the primary mechanism used by RTX include 
arranging random memory access during the ray tracing of diverg- 
ing rays and mechanism for GPU work creation, which includes data 
transfers between different kernels through a cache on a chip. They 
suggest that the RTX technology could be used for solving problems 
other the classical ray tracing. I. Wald et. Al.[21] ),explored the first 
proof-of-concept example of using Turing generation’s RT cores on a 
problem that is not ray tracing. Specifically, for point location in struc- 
tured tetrahedral meshes (i.e., given a point p and tetrahedral mesh m, 
find the tetrahedron t containing p). They tried different approaches 
to reform this problem, such that it can be mapped to RT cores with 
varying degrees of RTX hardware acceleration. Then they compared 
it to reference implementation that does not use RTX technology to 
show that it is possible to use the RT cores for this problem effectively. 
Their attempt to use the RTX cores for something beyond classical ray 
tracing worked and the kernels provided performance improvements 
over the reference implementation. Therefore, proving that RT cores 
can be used for something other than classical ray tracing and raising 
a question for further research; what other three traversal problems 
can be accelerated with RT cores? 
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3.3 Advanced Shading Technologies 


Turing includes several new advanced shading features that improve 
performance, enhance image quality, and enable new levels of geo- 
metric complexity. [16] 


3.3.4 Mesh Shading 


Today's graphic pipeline is very effective at rendering the details of 
a single object but still has limitations. Each object requires its own 
unique draw call from the CPU, and the shader model is a per-thread 
model which limits the types of algorithms that can be used. Mesh 
shading introduces two new shader stages, Task Shader and Mesh 
shader, that support this same functionality but with much more 
flexibility and efficiency. [16] 

This new model makes it possible to support an order of magnitude 
more objects per scene by moving the key performance bottleneck 
of object list processing off the CPU and into a highly parallel GPU 
mesh shading program. For example, if we have a wide field of view 
with hundreds of thousands of individual objects. Instead of sending 
each object to GPU with a unique draw call from the CPU, a developer 
can now send the GPU a list of many objects. The task shader then 
processes this object list in parallel, launches the mesh shader to shade 
corresponding triangles, and submits them to the rasterizer.[16] 


3.3.2 Variable Rate Shading 


VRS (Variable Rate Shading) allows developers to control shading 
rate dynamically, shading as little as once per sixteen pixels or as often 
as eight times per pixel. The application specifies shading rate using 
a combination of shading-rate surface and a per primitive (triangle) 
value. VRS allows developers to shade more efficiently, reducing work 
in screen regions where full resolution shading would not give any 
visible image quality benefit and therefore improving frame rate.| 16] 


3.3.3 Texture-Space Shading 


With texture-space shading, objects are shaded in a private coordinate 
space (a texture space) that is saved to memory, and pixel shaders 
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sample from that space rather than evaluating the result directly. With 
the ability to cache shading results in memory and reuse/resample 
them, developers can eliminate duplicate shading work or use different 
sampling approaches that improve quality.| 16 | 


3.3.4 Multi-View Rendering 


Multi-View Rendering (MVR) extends Pascal’s Single Pass Stereo 
(SPS). While SPS allowed rendering of two views that are the same 
except for an X-axis offset, MVR allows rendering of multiple views 
in a single pass even if the views are based on totally different origin 
positions or view directions.[16] 
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4 Ampere microarchitecture 


The first GPU of Ampere microarchitecture was A100 GPU, NVIDIA's 
eighth-generation data center GPU, released in May 2020. Equipped 
with a GA100 chip, A100 GPU comes with new features and capa- 
bilities, allowing for greater scale-up in data centers using NVLink, 
NVSwithc, and InfinityBand, or scale-out to support multiple inde- 
pendent users or tenants on a single GPU with MIG (Multi-Instance 
GPU).[22] In the fourth quarter of 2020, NVIDIA unveiled its con- 
sumer GPU line-ups equipped with GA102/104/106/107 chips (from 
now on GA10X). The GA10X GPUs (including GA100) build on ad- 
vances and features introduced in previous generations, with third- 
generation Tensor Cores, 2x FP32 processing, and second-generation 
RT cores (not included in GA100 GPU). While also introducing some 
brand new features like L2 cache residency control to manage data 
to keep or evict from the cache. Another notable addition is a new 
asynchronous copy instruction, which directly copies data from global 
memory into shared memory (eliminating the need for usage of in- 
termediate register files); it can also perform the coping in the back- 
ground.[22, 23] 


4.1 Hardware architecture 


4.1.1 Datacenter GPUs 


A full GA100 GPU contains 8 GPCs (Graphics Processing Clusters), 
8 TCPs (Texture Processing Clusters) per GPC, and 2 SMs (Stream- 
ing Multiprocessors) per TPC, which adds to a total of 128 SMs per 
GPU compared to 84 in Volta's GV100 GPU. This new generation of 
data center GPUs introduces a significant increase in L2 cache size; 
the GA100 contains 40 MB, which is nearly seven times more than 
GV100's 6 MB L2 cache. Moreover, GA100's L2 cache, because of its 
new structure, provides two times the L2 cache read bandwidth of 
the Volta's GV100 GPU.[22] 

The GA100 SM introduces new SM architecture, with increased 
performance and some new or enhanced features. A GA100's SM 
contains the same number of FP32, FP64, and INT32 cores (64, 32, 
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64, respectively) as Volta’s GV100. However, the full GA100 GPU 
contains more SMs overall. Each SM contains 192KB of combined 
shared memory and L1 cache (1.5x times more than in GV100's) and 
one third-generation TCs (Tensor Cores) per partition (4 per whole 
SM). Volta’s SM contained eight first-generation TCs, but the TCs in 
Ampere GPUs provide twice the performance of the Volta’s TCs.[22 | 


4.1.2 Consumer GPUs 


Consumer side GA102, GA104, GA106, and GA107 (from now on 
GA10X) mainly power GPU lineups for gaming PCs and professional 
workstations. The top-of-the-line GA102 chip comprises 7 GPCs, 6 
TPCs per GPC, and 2 SMs per TPC, for a total of 84 SMs per GPU, 
compared to 68 in Turing’s TU102. In previous generations of NVIDIA 
GPUs, ROP (Render Output) units have been tied to the memory 
controller and L2 cache. Beginning with GA10X GPUs, ROP units are 
now part of the GPC, each containing two ROP partitions (each with 
8 ROP units); this change boosts the performance of raster operations. 
With its 7 GPCs, the full GA102 GPU consists of 112 ROPs compared 
to 96 ROPs available in TU102 GPU. Furthermore, some of the Ampere 
GPUs (i.e., RTX 3080, RTX 3090) come with new GDDR6X memory, 
which offers 50% more bandwidth than their predecessors.| 23 | 

The GA10X improves on capabilities of previous generations, like 
Turing’s RT and Tensor Cores, and Volta’s and Turing’s concurrent 
FP32 and INT32 operations. Like in Turing GPUs’ SM, the GA10X SM 
is partitioned into four processing blocks, each with 64KB register 
files, LO instruction cache, one warp scheduler, one dispatch unit, and 
a set of math and other units (16 FP32 and 16 INT32/FP32 cores). 
Unlike in SMs of Turing microarchitecture, which included one first- 
generation RT core and two second-generation TCs (Tensor Cores), the 
GA10X SM contains one second-generation RT core and only one third- 
generation TC. However, it is twice as powerful as its predecessor. The 
combined L1 data cache/shared memory subsystem size has increased 
by 33% compared to Turing’s, more specifically from 96 KB to 128 
KB. This means that in graphics workloads and async compute, the 
GA10X will allocate 64 KB L1 data/texture cache ( double the 32KB 
in Turing), 48 KB shared memory, and 16 KB reserved for various 
graphics pipeline operations. As in Turing, when in compute mode, 
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the GA10X SM supports various configurations of L1 data cache and 
shared memory.[22, 24] 


RT Cores: 


GA10X SM can process two compute (RT core and compute) work- 
loads simultaneously and is not limited to just compute and graphics 
(RT core and graphics) workloads simultaneously as in prior GPU 
generations. Furthermore, Ampere's new second-generation RT cores 
include several enhancements, those combined with improvements to 
caching subsystem, increase the RT core performance by a factor of 
two over the RT core in Turing GPUs.[23] NVIDIA has also added a 
new component to the second-generation RT cores for interpolating 
the triangle position, which accelerates ray-traced motion blur.[23, 25] 
Motion blur can occur when the camera is moving and looking across 
various static objects in the scene or when the geometry is moving in 
front of the static /moving camera. GPUs of prior generations could 
accelerate moving camera type of motion blur quite well. However, 
performing motion blur on moving geometry can be quite challenging. 
The new Interpolate Triangle Position unit can generate triangles in 
the BVH (bounding volume hierarchy) tree in between existing trian- 
gle representations based on object motion. Thanks to this, rays can 
intersect triangles at their expected positions at specified times. This 
new unit allows accurate ray-traced motion blur rendering to occur 
up to eight times faster than the Turing GPUs.[23] 


4.2 2xFP32 processing 


In Turing, each of the four SM’s partitions has two primary data paths. 
One for FP32 and the other for INT32 operations. Ampere’s GA10X 
GPUs also include two data paths in each partition. One consists of 
16 FP32 CUDA Cores capable of 16 FP32 operations per clock. The 
second one consists of both 16 FP32 and 16 INT32 CUDA cores and 
can execute either 16FP32 or 16 INT32 operations per clock. Together, 
all four partitions can execute 128 FP32 operations per clock, which 
is double the FP32 rate of Turing SM (which could only execute 64 
FP32 operations alongside 64 INT32 operations per clock).[22, 23] 
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The theoretical peak performance of the FP32 rate is double that of 
Turing. However, the actual speedup depends on the workload, more 
specifically on the ratio of FP32 instructions to INT32 instructions; the 
theoretical 2x speedup is only achievable in workloads consisting of 
only FP32 instructions. 


4.3 Third-generation Tensor Cores 


TCs (Tensor cores) were first introduced in Volta GPUs to perform 
tensor/matrix operations. Later, an improved second version of TCs 
was released with the Turing generation’s GPUs, and now Ampere’s 
GA10X incorporates the third-generation of Tensor cores with even 
more features and improvements. In each GA10X’s SM partition is 
only one TC, in contrast to two in Turing and Volta, but this decrease 
is balanced by an increase in the number of SMs per GPU. More im- 
portantly, a third-generation TC can perform 256 in GA100 GPU and 
128 in other GA10X GPUs FP16/FP32 FMA operations per clock cycle, 
four or two times the amount compared to second-generation TC, and 
same or twice the computational power per SM.[23] If we compare 
not just per SM performance but total GPU performance, we find that 
Ampere’s A100 GPU has 2.5x higher mixed-precision TC performance 
than its predecessor Volta’s V100 GPU.[22] With other Ampere GPUs, 
the peak performance of standard mixed-precision TC operation is 
approximately the same or slightly better than their Turing’s coun- 
terparts. This difference mostly boils down to the difference in the 
amount of SMs, because the overall TC performance in mixed preci- 
sion FMA operation per SM is approximately the same in Ampere’s 
GA10X GPUs (excluding GA100) and Turing’s TU10X GPUs. 

In addition to FP16 precision supported in Volta’s TCs, and later ex- 
panded with the INT8, INT4, and binary 1-bit precision in Turing’s TCs, 
Ampere’s Tensor Cores add support for two new data types, BF16 and 
TF32, supported in all GA10X GPUs, and third, FP64, which is exclu- 
sive for GA100 GPU. BF16 is an alternative to IEEE FP16 and includes 
an 8-bit exponent, 7-bit mantissa, and 1-bit sign. NVIDIA states that 
BF16 is viable for NN (neural network) training in mixed-precision 
mode.| 22, 23] This is congruent with finding of Dhiraj Kalamkar, et. 
al.[26], who in their paper tested BF16 format in various DL (deep 
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learning) models and concluded that BF16 is a viable alternative to 
half-precision format for DL training and even has a few advantages 
over FP16. The mixed precision TC mode with BF16/FP32 delivers the 
same performance as FP16/FP32 mode. The second newly supported 
data type is TF32, allowing developers to directly use FP32 data with 
TCs. TCs read FP32 data and use the same range as FP32 with reduced 
internal precision before producing a standard FP32 output (TF32 add, 
FP32 accumulate). TF consists of an 8-bit exponent (same as FP32), 10- 
bit mantissa (same precision as FP16), and one sign-bit.[22] However, 
the use of TF32 comes with performance penalization. Compared to 
the use of mixed precision with FP16, the peak performance of TC 
is halved. Many applications in HPC, namely from areas of scientific 
and research disciplines, rely on double precision (FP64) computa- 
tion. To help with the acceleration of these workloads, A100 GPU’s 
(which is powered by GA100) TC add support for IEEE-compliant 
FP64 computations, delivering up to 2.5x the FP64 performance of its 
predecessor Volta’s V100 GPU.[22] 

As the computational power needed to process neural networks 
increases, the demand for more efficient models and computation rises 
too. Neural network pruning (removing unnecessary model parame- 
ters yielding a so-called sparse network) reduces model complexity 
while maintaining accuracy. To exploit network pruning, Ampere’s 
GPUs introduce the concept of FGSS (fine-grained structured spar- 
sity ).[27] FGSS imposes a constraint on the allowed sparsity pattern, 
making it more efficient for hardware to do the necessary alignment of 
input operands. The enforced structure is a 2:4 Sparse matrix, meaning 
that at least two elements must be zero out of every four elements. 
Due to the well-defined structure of the matrix, it can be compressed 
efficiently and reduce memory storage and bandwidth by almost 2x. 
Furthermore, Ampere GPUs introduce a new Sparse Matrix Multiply- 
Accumulate instruction, which skips the compute on entries with zero 
values, doubling the TC compute throughput.|22 | 


4.4 Multi-Instance GPU 


Some acceleration tasks are not demanding, such as early-stage de- 
velopment or inference on simple and small models.[22] In such a 
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case dedicating the whole GPU to one application would be waste- 
ful. This problem was first addressed using time-slicing and later in 
Kepler microarchitecture with MPS (Multi-Process Service), which 
allowed multiple CPU processes to be combined into a single applica- 
tion context and run on the GPU. Later Volta advanced MPS by adding 
hardware acceleration of critical components.[7] However, with MPS, 
memory system resources were shared across all the applications. 
Therefore, one application could interfere with others if it had a high 
demand for DRAM bandwidth or its requests oversubscribed the L2 
cache. Volta’s MPS, which remains fully supported on Ampere, was 
designed for sharing the GPU among apps from a single user, but not 
for multi-server or multi-tenant use cases.[22 | 

NVIDIA comes up with a new feature, MIG (Multi-Instance GPU), 
available in Ampere’s data center accelerators with GA100 GPU. The 
MIG feature can partition each GPU into two to seven GPU instances. 
Each GPU instance behaves like a smaller, fully capable independent 
GPU that includes a predefined number of GPCs (Graphics Processing 
Clusters), SMs (Streaming Multiprocessors), L2 cache slices, memory 
controllers, and frame buffer memory. Each instance’s SMs have sepa- 
rate and isolated paths through the entire memory system; L2 cache 
banks, memory controllers, and DRAM address busses are aligned 
uniquely to an individual instance. This ensures (in contrast with 
MPS) that each user’s workload can run with predictable and stable 
throughput and latency, with the same L2 cache allocation and DRAM 
bandwidth even if other tasks are trashing their caches or saturating 
their DRAM interface.[22] 


4.5 RTXIO 


The size of game data, CAD models, scientific visualizations, and 
other datasets that need to be loaded from drives to GPU memory 
is ever-increasing. Some of these datasets (i.e., game data) even use 
lossless compression to shrink their storage footprint. Games typically 
read files from the disk, using the CPU to decompress the game image. 
Decompressing game data from an HDD takes only a few CPU cores. 
Nevertheless, decompressing data from modern SSDs may consume 
a substantial amount of CPU resources.| 23] 
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To address this problem, NVIDIA comes with RTX IO, which is a 
suite of technologies that enable rapid GPU-based loading and decom- 
pression of games and creative assets, accelerating I/O performance. 
When used with Microsoft’s DirectStorage for Windows API, RTX IO 
offloads dozens of CPU cores’ worth of work to the GPU. Specifically, 
RTX IO allows the reads coming through DirectStorage to remain com- 
pressed and delivered to the GPU for decompression; this removes the 
load from the CPU and moves the data from storage to the GPU more 
efficiently and in compressed form, improving I/O performance.|23 | 

RTX IO improves game frame rates and enables near-instantaneous 
game loading. Also, some artifacts, like object pop-in and stutter, can 
be reduced, and high-quality textures can be streamed at an increased 
rate. RTX IO also benefits artists and designers working with large 
datasets such as those used for rendering complex CAD models or 3d 
scientific visualizations.[23 | 
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The processor’s speed increases as the clock frequency and the num- 
ber of transistors increase. This has an important side effect, the heat 
dissipated by processors. To overcome this problem, instead of in- 
creasing the clock frequency, the processor designers came with a 
new paradigm, multicore processors; with it, an application can run 
multiple threads of execution (sequence of instructions) simultane- 
ously. This concept is called parallel computing. Parallel computing 
can significantly increase performance for large applications in various 
areas like artificial intelligence and machine learning, physics, biology, 
economics, and many others. [28 |] 

GPUs (Graphics Processing Units) exemplify a specialized, mas- 
sively parallel processor. They can be used not only for computation 
in computer graphics or image processing but also for acceleration in 
applications that are traditionally handled just by the CPU (Central 
Processing Unit). This process of using a GPUs alongside a CPU to 
perform computation in areas other than graphics is called GPGPU 
(general-purpose computing on a GPU). 

GPU has several benefits; GPUs provide significantly higher in- 
struction throughput and memory bandwidth than the CPU. Many 
applications leverage these higher capabilities to run faster on the GPU 
than on the CPU. The difference in capabilities between these device 
types exists because they are designed with different goals. CPU is de- 
signed to excel at executing a sequence of operations, called a thread, 
as fast as possible and can execute a few tens of these threads in paral- 
lel. The GPU can execute thousands of parallel threads, amortizing 
the slower single-thread performance to achieve greater throughput, 
meaning that the GPU specializes in high parallel computations. De- 
voting more transistors to data processing, e.g., floating-point compu- 
tations, is beneficial for highly parallel computations. The GPU can 
hide memory accesses latencies with computation instead of relying 
on large data caches and complex flow control to avoid long memory 
access latencies. [29 | 

Applications with a high degree of parallelism can exploit this 
massively parallel nature of a GPU to achieve higher performance 
than a CPU.[29] There are several tools available to develop a general- 
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purpose application for GPUs. The two presented below are CUDA 
(Compute Unified Device Architecture) and OpenCL (Open Comput- 
ing Language). CUDA is a parallel computing platform and program- 
ming model developed by NVIDIA for GPGPU programming.[30] 


OpenCL is a framework for writing programs that execute across 
heterogeneous platforms, i.e., CPUs and GPUs.[31] 


5.1 CUDA 


CUDA is a general-purpose parallel computing platform and program- 
ming model that leverages the parallel compute engine in NVIDIA 
GPUs to solve many complex computation problems more efficiently 
than on a CPU.[29] CUDA comes with a software environment that al- 
lows developers to use C++ as a higher-level programming language. 
Other languages, application programming interfaces, or directives- 
based approaches are supported, such as FORTRAN, DirectCompute, 
and OpenACC.[29] In this chapter, we will focus on features available 
for CUDA 7.x and NVIDIA GPUs of compute capability 5.x and newer. 


5.1.1 Programming Model 


The CUDA programming model assumes that CUDA threads execute 
on a physically separate device that operates as a coprocessor to the 
host running the C++ program. For example, kernels execute on a 
GPU, and the rest of the C++ program executes on a CPU.[29] 

The CUDA programming model also assumes that both host and 
device maintain their own separate memory spaces in DRAM, referred 
to as host memory and device memory, respectively. [29] 


Kernels: 


CUDA C++ extends C++ by allowing the programmer to define C++ 
functions, called kernels. When the host calls a kernel, it is executed 
N times in parallel by N different threads on a GPU. [29] 

A kernel is defined using the __global__ declaration specifier, and 
the number of threads that execute that kernel for a given call is 
specified using < < <...> > > execution configuration syntax.[29 | 
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Bellow, we show an example of a kernel that performs the addition 
of vectors d_a and d_b and stores the result in d_c. Note that vec_add 
kernel will be executed on grid_dim * block_dim threads. Therefore, 
if we want to get the correct sum, these variables need to be set so that 
their product will be equal or greater than vec_length. 


__global__ void vec add(int* a, int* b, int* c, 
int vec length) 
1 
int i - threadIdx.x * blockIdx.x * blockDim. 
X; 
if (i < vec_length) 
c[i] = a[i] + b[i]; 


int main() 


{ 
int block_dim = ...; 
int grid_dim = ...; 
vec_add <<<grid_dim, block_dim>>> (d_a, d_b, 
d_c, vec_length); 
} 
Thread Hierarchy: 


Thousands of threads typically execute a kernel. These threads are 
organized into one-, two- or three-dimensional groups, called thread 
blocks. All the threads of a thread block reside on the same SM 
(Streaming Multiprocessor). Consequently, these threads can cooper- 
ate by sharing data through shared memory and synchronize their 
execution to coordinate memory accesses. However, due to the limited 
memory resources of an SM, there is a limited number of threads per 
block (1024 for most NVIDIA GPUs).[29] 

Multiple equally shaped thread blocks can execute a kernel. These 
blocks form a one-, two- or three-dimensional grid; grids are executed 
on a whole GPU independently, meaning that multiple thread blocks 
from one grid can run concurrently on different SMs and in any order. 
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The number of thread blocks in a grid is usually directed by the size of 
the processed data, which typically exceeds the number of processors 
in the system. [29] 

When we call a kernel, we need to define the number of threads 
per thread block and the number of blocks per grid on which this 
kernel executes. We specify these parameters in kernel call’s execution 
configuration; the first parameters specify the dimensions of the grid 
and the second dimensions of blocks. It can be either integer or dim3 


type.[29 | 


kernel«««blocks per grid, threads per block 
5552.5 3 


3 


dim3 is a vector type that takes one, two, or three parameters 
(default value is 1); with it, we can define the size of each dimension. 


/* 512 (873) threads per block. */ 
dim3 threads per block(8, 8, 8); 
/* 8 (2*4) blocks per grid. */ 
dim3 blocks per grid(2,4); 


Each kernel has pre-define variables of dim3 type. We can use 
these to get thread/block index or thread/block dimension. For ex- 
ample, we get a thread's indexes in a three-dimensional grid with 
three-dimensional thread blocks: 


int i x = threadIdx.x + blockDim.x * blockIdx.x; 
int i y threadIdx.y * blockDim.y * blockIdx.y; 
int iz threadIdx.z + blockDim.z * blockIdx.z; 


5.1.2 Hardware Implementation and SIMT architecture 


The NVIDIA GPU architecture is built around a scalable array of mul- 
tithreaded SMs (Streaming Multiprocessors). When a CUDA program 
on the host CPU invokes a kernel grid, the thread blocks of the grid 
are enumerated and distributed to multiprocessors with available exe- 
cution capacity. The threads of the thread block execute concurrently 
on one SM. As thread blocks terminate, new blocks are launched on 
the vacated SMs.[29] 

An SM is designed to execute hundreds of threads concurrently. 
To manage such a large amount of threads, it employs a unique archi- 
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tecture called SIMT (single instruction, multiple threads). The instruc- 
tions are pipelined, leveraging instruction-level parallelism through 
simultaneous hardware multithreading. The SIMT architecture is akin 
to SIMD (single instruction, multiple data) vector organization in that 
a single instruction controls multiple processing elements. However, 
SIMD uses Execution Units or Vector Units, while SIMT expands it 
to leverage threads. More specifically, in SIMT, multiple threads per- 
form the same instruction on different data sets (similar to SIMD, but 
with threads), but unlike SIMD, it also enables programmers to write 
thread-level parallel code for independent, scalar threads, as well as 
data-parallel code for coordinated threads.[29, 32] 

Although SIMT enables thread-level parallelism, it can severely 
affect performance. Because the SM creates, manages, schedules, and 
executes threads in a group of 32 parallel threads, called warps. Indi- 
vidual threads composing a warp start together at the same program 
address, but they have their own instruction address counter and reg- 
ister state and are free to branch and execute independently. However, 
a warp executes one common instruction at a time, so full efficiency 
is realized when all 32 threads of a warp agree on their execution 
path, meaning they always execute the same instruction. If threads of 
a warp diverge, i.e., via data-dependent conditional branch, the warp 
executes each branch path taken, disabling threads that are not on the 
path.| 29] 


5.1.3 Memory Hierarchy 


CUDA threads may access data from multiple memory spaces on the 
device during execution. Each thread has its own local memory, not 
to be confused with registers. Each thread block has shared memory 
visible to all threads of that block, with the same lifetime as the block. 
All threads have access to the same global memory. There are also 
two additional read-only memory spaces accessible by all threads, the 
constant and texture memory spaces; alongside global memory, they 
are persistent across kernel launches by the same application.|29 | 
The throughput of memory accessed by a kernel can vary sub- 
stantially, depending on access patterns for each type of memory. 
Therefore, we need to minimize data transfers between host and de- 
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vice, and plan and optimize data placement and access patterns within 
a device memory. [29 | 


Global Memory: 


Global memory resides in device memory, which is accessed via 32-, 
64-, or 128- byte memory transactions. These transactions must be 
naturally aligned; only the 32-, 64-, or 128-byte segments of device 
memory aligned to their size can be read or written by a memory 
transaction. [29 | 

When a warp executes an instruction that accesses global memory, 
it coalesces the memory accesses of the threads within the warp into 
one or more of these memory transactions, depending on the size of 
the word accessed by each thread and the distribution of the memory 
addresses across the threads. Generally, the more transactions are 
necessary, the more unused words are transferred in addition to the 
words accessed by the threads, reducing the instruction throughput 
accordingly. For example, if a 32-byte memory transaction is generated 
for each thread's 4-byte access, throughput is divided by 8.[29 | 

In the following code sample, we demonstrate device memory 
allocation and copying of data (which contain matrices) from the 
host to a device. Allocation is typically done with cudaMalloc() and 
freeing with cudaFree(). To copy data between host and device, we 
use cudaMemcpy(). Then we launch a simple kernel that performs 
matrix multiplication (C = A * B). At the end we need to release the 
memory we allocated. This is done with cudaFree(). 
template< typename T > 
__global__ void multiply matrices naive kernel(T 

* A, T* B, T* C, dim3 dim A, dim3 dim B, dim3 

dim C, size t max dim) 


1 
int ty id = threadIdx.y + blockDim.y * 
blockIdx.y; 
int tx id = threadIdx.x + blockDim.x * 
blockIdx.x; 
if (ty ld >= dim C.y |] tx id >= dim C.x) 
return; 
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T curr_cell = 0; 


für (size t i = 0; i < max dim; ++i) 
if(i < dim A.x && ty id « dim A.y && 
tx id « dim B.x && i « dim B.y) 
Curr cell += Ali + dim A. x * ty td] 
* B[tx id + dim B.x * i]; 


C[tx id + dim C.x * ty 1d] = curt cell: 


/* Multiplies matrices A and B, stores the 
product in C. #7 

template< typename T > 

void multiply matrices naive(matrix t«T»& A, 
matrix t«T»& B, matrix t«T»& C) 

{ 

size_t max_dim = std::max({ B.dim.x, B.dim.y 
y; A.dim.x, A.dim.y }); 


/* Allocate matrices in device memory */ 
Tx dA, * dB, * d C; 

cudaMalloc(&£d A, A.mem size); 
cudaMalloc(&£d B, B.mem size); 
cudaMalloc(&£d C, C.mem size); 


/* Copy matrices A and B from host memory to 
device memory. */ 
cudaMemcpy (d A, A.data, A.mem size, 
cudaMemcpyHostToDevice); 
cudaMemcpy(d B, B.data, B.mem size, 
cudaMemcpyHostToDevice); 


/* Execution parameters. */ 
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dim3 thread_block_dim(32, 32); 
dim3 grid_dim(std::ceil(static_cast <double>( 
C.dim.x) / thread_block_dim.x), 
std::ceil(static_cast<double>(C.dim.y) / 
thread_block_dim.y)); 


multiply_matrices_naive_kernel<T> <<< 
grid dim, thread_block_dim>>>(d_A, d B, 
d-C, A.dim, B.dim, C.dim, max dim); 


/* Copies matriz. C containing product of A 
and B from device memory to host memory */ 

cudaMemcpy (C.data, d C, C.mem size, 
cudaMemcpyDeviceToHost); 


/* Clean up. */ 
cudaFree(d A); 
cudaFree(d B); 
cudaFree(d C); 


Constant Memory: 


Constant memory is a relatively small dedicated memory space in 
device memory and is cached in the constant cache. The main differ- 
ence from global memory is that constant memory is the most efficient 
when all threads sending a request, access the value at the same ad- 
dress simultaneously. A request is split into as many separate requests 
as there are different memory addresses in the initial request. [29] 


Texture and Surface Memory: 


CUDA supports a subset of the texturing hardware that the GPU 
uses for graphics to access textures and surface memory. Texture and 
surface memory spaces reside in device memory and are cached in 
texture cache, which is optimized for a 2-dimensional spatial locality. 
Threads of the same warp that read texture or surface addresses that 
are close together in 2-dimensional space will achieve the best perfor- 
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mance. It also offers other features, like addressing modes and data 
filtering.[29] 


Local Memory: 


Each thread has its own local memory, and because it is part of device 
memory, it has the same restrictions as global memory. However, local 
memory is organized such that consecutive threads access consecutive 
32-bit words. Therefore, accesses are fully coalesced as long as all 
threads in a warp access the same relative address. By default, the 
compiler decides which data will be stored in registers and which will 
fall into local memory. Registers have much lower latency and higher 
bandwidth than local memory, but each thread has a limited amount 
of them. For example, large structures or arrays are likely to be placed 
into local memory.|29 | 


Shared Memory: 


Shared memory is equivalent to user-managed cache; the application 
explicitly allocates and accesses it, typically to minimize global mem- 
ory accesses. Because shared memory is on-chip, it has much higher 
bandwidth and much lower latency than local or global memory. 
Shared memory is divided into equally-sized modules called banks 
to achieve high bandwidth. Banks can be accessed simultaneously. 
Any memory read or write request made of N addresses that fall into 
N distinct memory banks can be accessed simultaneously, yielding 
an overall bandwidth that is N-times as high as the bandwidth of 
a single module. However, if addresses of memory requests fall in 
the same memory bank, there is a bank conflict, and the access has 
to be serialized. It is essential to avoid bank conflicts to get the best 
performance.[29] 

The NVIDIA GPUs of Maxwell and newer microarchitecture con- 
tain 32 shared memory banks organized such that successive 32-bit 
words map to successive banks.[29] In the example below, the second 
write into s_vec will generate bank conflict because each thread of a 
warp with index N writes into the same bank as a thread with index 
N+16. However, if multiple threads access the same word from the 
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same bank, the word will be broadcasted to all the threads; therefore, 
access will be conflict-free. 


__shared__ int32 t s vec[256]; 


/* No bank conflict. */ 
s vec[threadIdx.x] - 


/* 2-way bank conflict. */ 
s vec[threadIdx.x * 2] - 


/* No bank conflict. */ 
s vec[threadIdx.x * 3] - 


/* No bank conflict. */ 
int32 t a = s vec[0]; 


There are two ways of allocating shared memory, either dynami- 
cally as we did in the example above; we specify the size per thread 
block with a third execution configuration parameter. Then we declare 
the shared memory array as an unsized array with extern shared 
specifiers (i.e., extern shared float arr[ |) within a kernel. The size 
is implicitly determined from the kernel's execution configuration 
parameter. We can use this method when the amount of shared mem- 
ory required is not known at the compile time. If the size is known 
at compile time, we can use static shared memory, which is allocated 
inside a kernel with — shared  specifier and size (i.e. shared . 
float arr[10] ). 

In the following code example, we will change the matrix mul- 
tiplication kernel from before to take advantage of shared memory. 
However, we generally cannot store the whole matrices in shared mem- 
ory due to its small size compared to global memory. However, we can 
divide the matrices into smaller sub-matrices (tiles) that fit into shared 
memory. In kernel bellow, each thread block computes a sub-matrix of 
C, with tiles from matrix A and B stored in shared memory. First, each 
thread copies one element from A and B into matrices s A and s B 
stored in shared memory, then computes one element of the product 
and accumulates it into curr cell. This continues for all the tiles in the 
corresponding rows/columns of A and B. 
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template< typename T > 

__global__ void multiply_matrices_shared_kernel ( 
T* A, T* B, T* C, size t tile size, dim3 dim A 
, dim3 dim B, dim3 dim C, size t max dim) 


int global ty id = threadIdx.y + blockDim.y 
* blockIdx.y; 

int global tx id = threadIdx.x + blockDim.x 
* blockIdx.x; 


/* Pointer to begining of dynamically 
allocated shared memory. */ 

extern | shared | T shared_mem_ptrl[]; 

/* Split the shared memory into two parts, 
one for each block size v block size 
matriz. s A and s B are sub-matrices of A 
and B, respectively. */ 


T* s A = shared mem ptr; 

T* s B = &s A[tile size * tile sizel; 

T curr cell = 0; 

for (size t stride = 0; stride * tile size < 


max dim; ++stride) 
1 
/* Thread indices for matriz A in 
current loop iteration */ 
Size t A tx id = stride * tile size + 
threadIdx.x; 
size t A ty id = global ty id; 


/* Thread indices for matriz B in 
current loop iteration */ 

size t B tx id - global tx id; 

size t B ty id - stride * tile size -* 
threadIdx.y; 
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/* Threads id in thread block */ 
Size t t id = threadldx.x + blockDim.x * 
threadldx.y; 


/* Fill sub-matrices in shared memory. 
xf 
if (A tx id < dim A.x && A ty id < 


dim A.y) 
s A[t id] = A[A tx id + dim A.x * 
A ty id]; 
else 


s A[t id] = 0; 
if( B tx id « dim B.x && B ty id « dim B 


s B[t id] = B[B tx id + dim B.x * 


/* Wait until the s A and s B is filled 
by all threads in block. */ 
__syncthreads(); 


/* Perform a matriz multiplication of 
currently loaded sub-matrices.x*/ 
if (global tx id < dim C.x 4 
global ty id « dim C.y) 
für (size t i = 0; i < tile size; ++ 
i) 
curr_cell += s_A[i + tile_size * 
threadldx.y] * s_B[threadIdx. 
x + tile size * il; 


/* Wait until all the threads of current 
thread block finish multiplying their 
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sub-matrices, before loading a new 
sub-matrices. */ 
__syncthreads(); 


if (global tx id < dim C.x && global ty id < 
dim C.y) 
C[global tx id + dim C.x * global ty id] 
- curr cell; 


/* Multiplies matrices A and B, stores the 
product in C. */ 

template< typename T > 

float multiply_matrices_tiled(matrix_t<T>& A, 
matrix_t<T>& B, matrix_t<T>& C, size t 
tile size) 


1 
size_t size shared memory = 2 * tile size * 
tile size * sizeof(T); 
multiply matrices tiled kernel«T» ««« 
grid dim, thread block dim, 
size shared memory >>> (dA, d B, d C, 
tile size, A.dim, B.dim, C.dim, max dim); 
} 


5.1.4 Asynchronous Concurent Execution 


CUDA exposes several operations as independent tasks that can oper- 
ate concurrently. These include computation on host/device, memory 
transfers between host and device, and memory transfers within a 
given device. However, the level of concurrency achieved between 
these operations will depend on the feature set and compute capability 
of the device.[29] 
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Concurrent execution between host and device is facilitated through 
asynchronous library functions that return control to the host thread 
before the device completes the requested task. Many device opera- 
tions can queue up together, using asynchronous calls; they will be 
executed by the CUDA driver when appropriate device resources are 
available. For example, kernel launches are asynchronous in respect 
to the host by default. This can be changed with the CUDA_LAUNCH 
_BLOCKING environment variable.[29 | 


Streams: 


Applications manage the concurrent operations through streams. A 
stream is a sequence of commands (possibly issued by different host 
threads) that execute in order. However, different streams may execute 
their commands concurrently or out of order with respect to one 
another. [29 | 

A stream is defined by creating a stream object and specifying it as 
a stream parameter to kernel launches and memory copies. [29] In the 
case of kernel launches, we specify the stream as the fourth execution 
configuration parameter. 


cudaStream_t streams [2]; 


cudaStreamCreate(% streams [0]); 
cudaStreamCreate(% streams [1]); 


vec add «««grid dim, block dim, 0, streams[0]>>> 
(da, db, d c, vec length); 

vec add «««grid dim, block dim, 0, streams[1]>>> 
(dd, de, d f, vec length); 


Default stream is used if kernel launches or memory copies do not 
specify any stream parameter. 
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5.2 OpenCL 


OpenCL (Open Computing Language) is an open royalty-free stan- 
dard for general purpose parallel programming across CPUs, GPUs, 
and other processors, giving software developers portable and effi- 
cient access to power of these heterogenous processing platforms (in 
contrast, CUDA only supports programming on NVIDIA GPUs).[33 | 

OpenCL consists of an API for coordinating parallel computation 
across heterogeneous processors, a cross-platform programming lan- 
guage, a cross-platform intermediate language with a well-specified 
computation environment, libraries, and a runtime system. [33] In this 
chapter, we will focus on GPU programming with OpenCL 1.2 on 
NVIDIA GPUs of compute capability 5.x and newer. 


5.2.1 OpenCL programming model and its comparison to CUDA 


An OpenCL application, similarly to CUDA, is implemented as both 
host code and device kernel code. The host code portion of an OpenCL 
application runs on a host processor according to the models native to 
the host platform. The application host code submits the kernel code 
as commands to OpenCL devices for execution. [33] Each device is a 
collection of one or more compute units, where each compute unit is 
composed of one or more processing elements. Processing elements 
execute code with SIMD (Single Instruction Multiple Data) or SPMD 
(Single Program Multiple Data) parallelism. In our case of a device 
being a GPU, compute units would correspond to the SMs (Streaming 
Multiprocessors) inside the GPU, and processing elements correspond 
to individual streaming processors inside each SM.[34] A kernel code 
uses OpenCL C programming language based on C99 specifications, 
with extensions and restrictions to support parallel kernels. [35 ] 


Kernel and Thread Hierarchy: 


When a kernel enqueue command submits a kernel for execution, 
an index space is defined. The kernel, the values associated with the 
arguments to the kernel, and the parameters that define the index 
space define a kernel instance. When a kernel instance executes on a 
device, the kernel function executes each point in the defined index 
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space. Each of these executing kernel functions is called a work item. 
The device manages the work items associated with a given kernel 
instance in groups called work-groups. These work-groups define a 
coarse-grained decomposition of the Index space, that is called an 
NDRange.[33] 

Work-items are akin to CUDA threads and are mapped onto one-, 
two-, or three-dimensional NDRange and work-groups. Where NDRange 
represents all the work-items launched and work-group is a collection 
of work-items accommodated on a single compute unit, i.e., SM.[34] 
This grouping mechanism differs from CUDA, where a grid is viewed 
as a collection of thread blocks (work-groups), not threads (work- 
items). 


5.2.2 OpenCL Memory Hierarchy and its comparison to CUDA 


Like in CUDA, the OpenCL memory model assumes that the host and 
device have separate memory spaces. Device memory, accessible from 
kernels, consists of four address spaces or memory regions. First, the 
global memory allows access to all work-items in all work-groups run- 
ning on any device within a context. The Second is constant memory, 
a region of global memory that remains constant during the execution 
of a kernel instance. The next two regions are a local memory, which 
is local to a work-group (akin to CUDA shared memory), and private 
memory, a region private to work-item (this corresponds to CUDA 
local memory/registers).[33] 

We can use address space qualifiers, global, _ local, constant, 
and. private to specify the memory region used to allocate a memory 
object. The default address space name for arguments to a kernel in a 
program or local variables of a kernel is — private.[33] If we want to 
use a dynamic size of local memory for a kernel, we need to make a 
pointer or an array argument for the kernel with — local qualifier. We 
can set the size as described in Chapter 5.2.3. 

The contents of global memory are memory objects. Memory ob- 
jects are handles to a reference counted regions of global memory and 
use the OpenCL type cl mem and fall into two distinct classes. First is 
a buffer, a memory object stored as a block of contiguous memory and 
used as a general-purpose object to hold data. The second is an image, 
which holds one-, two-, three- dimensional images, whose formats 


40 


5. GPU PROGRAMMING 


are based on standard image formats used in graphics applications. 
An image is an opaque data structure managed by functions defined 
in the OpenCL API. 

We can create three buffers in device memory, which represent 
matrices as follows: 


/* Create device buffers. */ 

cl_mem d_A = clCreateBuffer(context_gpu, 
CL MEM READ ONLY, A size, NULL, &ret); 

cl mem d B = clCreateBuffer(context gpu, 
CL MEM READ ONLY, B size, NULL, &ret); 

cl mem d C = clCreateBuffer(context gpu, 
CL MEM READ WRITE, C size, NULL, &ret); 


5.2.3 The OpenCL Framework 


The OpenCL framework allows applications to use a host and one or 
more OpenCL devices as a single heterogeneous parallel computer 
system. The framework contains the following components:[33] 


e OpenCL Platform layer: The platform layer allows the host pro- 
gram to discover OpenCL devices and their capabilities and 
create contexts.[33 ] 


e OpenCL Runtime: The runtime allows the host program to ma- 
nipulate contexts once they have been created. [33] 


e OpenCL Compiler: The OpenCL compiler creates a program 
executable that contains OpenCL kernels. The OpenCL compiler 
may build program executable from OpenCL C source strings; 
OpenCL C supports a subset of the ISO C99 language with ex- 
tensions for parallelism.[33] 


Platform: 


A platform is a host plus a collection of devices managed by the 
OpenCL framework that allows an application to share resources 
and execute kernels on devices in the platform. [33] It corresponds to 
a specific OpenCL implementation, i.e. NVIDIA CUDA, Intel OpenCL. 
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Unlike in CUDA, in OpenCL, we need to specify which platform we 
want to use (because OpenCL is a cross-platform standard). 

We can obtain a list of available platforms with the function clGet- 
PlatformIDs and the platform info with clGetPlatformInfo. In our 
matrix multiplication example, we obtain a list of platforms and then 
query their names: 


/* Get number of available platforms. */ 
cl_uint num_platforms = 0; 
clGetPlatformIDs(NULL, NULL, &num_platforms) ; 


/* Get the platforms. */ 
cl_platform_id* platform_ids = new 
cl platform id[num platforms]; 
clGetPlatformIDs(num platforms, platform ids, & 
num platforms); 


/* Print available platform names. */ 
for (size t j = 0; j < num platforms; ++j) 
{ 
const size t MAX NAME SIZE = 100; 
char platform name[MAX NAME SIZE]; 
size_t name size = 0; 


ret = clGetPlatformInfo(platform ids[jl, 
CL PLATFORM NAME, MAX NAME SIZE, 
platform name, &name size); 


std::cout << "platform, number, y" << j << "y 
burg 
for (size t i= 0; i < name size; ++i) 
std::cout << platform name[il; 
std::cout << std::endl; 


After acquiring the list of platforms, we can query available devices 
on these platforms with the clGetDeviceIDs(). In our example, we 
query available GPUs. 


/* Get number of devices. */ 
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cl_uint num_devices; 
clGetDeviceIDs(curr platform id, 
CL DEVICE TYPE GPU, NULL, NULL, £&num devices); 


/* Get available devices. */ 

cl device id* device ids = new cl device id[ 
num devices]; 

clGetDeviceIDs(curr platform id, 
CL DEVICE TYPE GPU, num devices, device ids, 
NULL); 


Context: 


Context is the environment within which the kernels execute and the 
domain defining synchronization and memory management. The con- 
text includes a set of devices, the memory accessible to those devices, 
the corresponding memory properties, and one or more command 
queues used to schedule the execution of kernels or operations on 
memory objects. [33] 

Context is created with clCreateContext(). In the call, we need to 
specify the devices we want to associate with the context and context 
properties, like which platform to use. 


/* List of context properties. */ 

const cl context properties context prop[3] = { 
CL CONTEXT PLATFORM, (cl context properties) 
curr platform id, 0}; 


/* Create a contezt. */ 

cl context context gpu = clCreateContext( 
context prop, num devices, &device id, NULL, 
NULL, &ret); 


Program and kernel objects: 


Once we have a context, we can create a program object from a pro- 
gram source or binary; a program consists of a set of kernels and 
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may also contain auxiliary functions and constant data. Our program 
needs to be compiled if we want to generate kernel objects. Compila- 
tion can be done online or offline; in offline compilation, the kernel 
is pre-built with OpenCL compiler, and the OpenCL API loads the 
generated binary. Since the kernel binary is already built, the time lag 
between starting host code and executing kernel is smaller than an 
online compilation. However, the problem with offline compilation is 
that in order to execute the program on various platforms, multiple 
kernel binaries must be included. In online compilation, the kernel is 
built from source code during runtime using OpenCL runtime library, 
i.e., clIBuildProgram(). This method is commonly known as just-in- 
time compilation. The advantage is that the host side binary can be 
distributed in a form that’s not device-dependent and that adaptive 
compiling of the kernel is possible. [33, 36] 

In our example, we create a program for our context_gpu context 
from a source string, build it, and create a kernel object. 


/* Create a program object and build it. */ 
cl program program gpu 1 = 
clCreateProgramWithSource(context gpu, 1, & 
kernel source const, NULL, &ret); 
clBuildProgram(program gpu 1, 1, &device id, 
NULL, NULL, NULL); 
/* Create a kernel object from 
matriz multiplication local kernel. */ 
cl kernel matrix multiplication local kernel - 
clCreateKernel(program gpu 1, " 
matrix multiplication local", Eret); 


A kernel object encapsulates a specific kernel function declared in a 
program with — kernel qualifier, and the argument values used when 
executing this kernel. The argument values and sizes are specified 
using the clSetKernelArg() for every argument. [33] 

Let us consider a kernel: ia 


__kernel void matrix multiplication local( 
. global int* A, _ global int* B, _ global int 
* C, unsigned int dim A x, unsigned int 
dim A y, unsigned int dim B x, unsigned int 
dim B y, unsigned int max dim, . local int* 
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S A, __local int* s B, unsigned int block size 


) 


We can set its argument's value and size: 


/* Set kernel arguments. */ 

clSetKernelArg( 
matrix multiplication local kernel, 0, sizeof( 
cl mem), &d A); 

clSetKernelArg( 
matrix multiplication local kernel, 1, sizeof( 
cl mem), &d B); 


clSetKernelArg( 
matrix multiplication local kernel, 9, 
block size * block size * sizeof(T), NULL); 
clSetKernelArg( 
matrix multiplication local kernel, 10, sizeof 
(unsigned int), &block size); 


These parts of the OpenCL framework significantly differ from 
CUDA. In CUDA, kernels are treated as functions, defined using CUDA 
C++ language extensions, and can be compiled together with the host 
code (with NVIDIA CUDA compiler). However, in OpenCL, kernels 
are objects that need to be compiled separately (online or offline) and 
are launched and configurated using OpenCL API. 


Command Queues: 


A command is an OpenCL operation submitted to a command queue 
for execution. For example, commands issue kernels for execution on 
compute device and manipulate memory objects.[33 | 

OpenCL objects such as memory, program, and kernel objects are 
created using a context. Operations on these objects are performed 
using an object called command queue. The command queue can be 
used to queue a set of commands in-order, but they may be executed 
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in-order or out-of-order. Applications can have multiple command 
queues to queue independent commands without requiring synchro- 
nization. Note that this works as long as these objects are not being 
shared. Sharing of objects across multiple command queues will re- 
quire appropriate synchronization.[33] 

Command queue is of cl_command_queue type and is created 
with clCreateCommandQueue(). In the example below, we create a 
command-queue and then enqueue commands that write into buffers 
from host memory and execute a kernel. 


/* Create a command queue. */ 

cl command queue cq gpu 1 = clCreateCommandQueue 
(context gpu, device id, 0, &ret); 

/* Copy matrices from host to device memory. */ 

clEnqueueWriteBuffer(cq gpu 1, d A, CL TRUE, 0, 
A size, h A.matrix, 0, NULL, NULL); 

clEnqueueWriteBuffer(cq gpu 1, d B, CL TRUE, 0, 
B size, h B.matrix, 0, NULL, NULL); 

clEnqueueWriteBuffer(cq gpu 1, d C, CL TRUE, 0, 
C size, h C.matrix, 0, NULL, NULL); 


/* Enqueue the kernel for execution. */ 
clEnqueueNDRangeKernel(cq gpu 1, 
matrix multiplication local kernel, dimensions 
, NULL, ND range size, work group size, O, 
NULL, NULL); 
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In this chapter, we will look at new major features of CUDA versions 
from CUDA 8 to CUDA 11, and on what new capabilities they in- 
troduce. Each new major version of CUDA usually accompanies a 
new generation of NVIDIA GPUs, adding tools for programmers to 
leverage the GPUs’ new hardware capabilities or expand the existing 
ones. 


6.1 CUDA 8 


A crucial goal for CUDA 8 is to support the Pascal microarchitec- 
ture. This includes support for native FP16 and INT8 computations 
for deep learning and other workloads and new Unified Memory ca- 
pabilities. While also presenting a new library for GPU-Accelerated 
Graph Analytics called nvGRAPH and adding experimental support 
for heterogeneous C++ lambdas.[37] 


6.1.1 Unified Memory 


UM (Unified Memory) is a component of the CUDA programming 
model, first introduced in CUDA 6.0, defining a managed memory 
space in which all processors (both CPUs and GPUs) see a single 
coherent memory image with a shared address space.[29] This means 
that managed memory is accessible to both the CPU and GPU using a 
single pointer. The system automatically migrates data allocated in 
UM between host and device, so it looks like CPU memory to code 
running on the CPU and like GPU memory to code running on the 
GPU [37] 

CUDA 6 UM was limited by the Kepler and Maxwell GPU mi- 
croarchitecture; all managed memory touched by the CPU had to be 
synchronized with the GPU before any kernel launch. The CPU and 
GPU could not simultaneously access a managed memory allocation, 
and the UM address space was limited to the size of GPU physical 
memory.|[37 | 

CUDA 8 expands the benefits of CUDA 6 Unified Memory, and 
new features of Pascal and newer GPUs further simplify programming 
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and sharing of memory between CPUs and GPUs. Two main hardware 
features enable these improvements: larger address space and page 
faulting capability. [37] 

Support for larger virtual address space combined with page fault- 
ing means that the CUDA system does not need to synchronize all 
managed memory allocation to the GPU before kernel launch. Instead, 
suppose a kernel running on the GPU accesses a page not resident in 
its memory. In that case, it faults, allowing the page to be automatically 
migrated to the GPU memory on-demand. Moreover, a large virtual 
address space enables applications to access the virtual memory of an 
entire system, meaning apps can oversubscribe the memory system; 
they can allocate, access, and share arrays larger than the total physical 
capacity of the system.| 37] 

Managed memory can be allocated either with cudaMallocMan- 
aged() routine or as a variable with __managed__ keyword as before, 
and now on supporting platforms, it can also be allocated with default 
system allocator (e.g., malloc() or new).[29] 

In the example below we want to calculate vectors m_c = m_a 
+ m_b and m_f = m_d + m_e. We would like to save some time 
by initializing vectors m_d, m_e with the host function, while the 
first vec_add kernel is calculated. This would be illegal with CUDA 
6 Unified Memory, even though the first kernel launch does not use 
vectors m_d, m_e. 


int* ma, * mb, * mc, * md, * me, * m f; 


cudaMallocManaged(&£m a, vec size mem); 
cudaMallocManaged(&£m b, vec size mem); 
cudaMallocManaged(&£m c, vec size mem); 
cudaMallocManaged(&£m d, vec size mem); 
cudaMallocManaged(&£m e, vec size mem); 
cudaMallocManaged(&£m f, vec size mem); 


init vec(m a, vec size); 

init vec(m b, vec size); 

vec add «««grid dim, block dim»»» (m_a, mb, mec 
, vec size); 
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init_vec(m_d, vec_size); 

init_vec(m_e, vec_size); 

vec add <<<grid_dim, block_dim>>> (m_d, m_e, mf 
, vec_size); 

cudaDeviceSynchronize(); 


6.1.2 Mixed Precision Computing 


Mixed precision means combining different numerical precisions in a 
computational method. The Pascal microarchitecture adds features 
to provide even higher performance for applications that can utilize 
lower precision computation by adding vector instructions that pack 
multiple operations into a 32-bit data path. Specifically, these instruc- 
tions operate on 16-bit floating-point data (FP16) and 8- and 16-bit 
integer data (INT8/INT16). For example, DP4A instruction performs 
the vector dot product between two 4-element vectors A and B, each 
comprising four single-byte values in a 32-bit word, storing the re- 
sult in a 32-bit integer, and adding it to a third argument C, also a 
32-bit integer.[38] To enable programmers to write their own code 
using these data types, CUDA provides built-in data types (e.g. half 
and half2) and intrinsic for FP16 arithmetic (e.g. hadd(),  hmul(), 
. hfma2()) and new vector dot products that operate on INT8 and 
INT16 values (__dp4a(), __dp2a()).[37] 


6.2 CUDA9 


With the launch of Volta GPUs and their new Streaming Multipro- 
cessor now containing Tensor Cores and combined L1 data cache 
and shared memory subsystem, while also supporting independent 
thread scheduling, CUDA needed to provide programmers with tools 
to leverage these new hardware capabilities and features. These new 
tools include a new programming model for managing groups of 
communicating threads called Cooperative Groups and a new API 
for programming Tensor Core matrix multiply and accumulate opera- 
tions. CUDA 9 also adds support for C++14 in CUDA device code, 
new algorithms in cuSolver and nvGraph, and more.[39] 
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6.2.1 Cooperative Groups 


In parallel algorithms, threads often need to cooperate to perform 
collective computations. Building these cooperative codes requires 
grouping and synchronizing of the cooperating threads. Historically 
CUDA programming model has provided a single, simple construct 
for synchronizing threads: a barrier across all threads of a thread block, 
as implemented with __syncthreads().[39] 

Cooperative groups introduce the ability to define groups of threads 
explicitly at sub-block and multi-block granularity and to perform 
collective operations, such as synchronization on them. This allows 
developers to express the granularity at which threads communicate, 
helping them express richer and more efficient parallel decomposition. 
[29, 39] 

The main concept in cooperative groups is that of objects specifying 
the set of threads that are part of it. This expression of groups as objects 
improves software composition since collective functions can receive 
an explicit object representing the group of participating threads. 
There are two main group types: the implicit and explicit groups. 
The implicit groups represent the launch configuration of the kernel, 
like thread block and grid. The first type of explicit groups are thread 
block tiles that enable programmers to partition the thread blocks into 
groups of N threads, where 2 <= N <= 32 and N is the power of 2. It is 
also possible to obtain tiles of sizes 64, 128, 256, 512 using experimental 
API. The second type is coalesced groups, which can discover and 
create a group of all coalesced threads; coalesced threads are threads 
of a warp that remain active on the same path in a data-dependent 
conditional branch in application code. [29] 

Let us look at an example of a vector sum reduction code that 
will not use cooperative groups. Each thread loads size/ grid size 
elements from global memory and stores them into shared memory. 
Then threads of the thread block add two elements from shared mem- 
ory into one. This continues until only one value is left. Now we need 
to write the sum acquired by the current thread block into global 
memory, considering we can only synchronize thread blocks. We can 
either write the result of each thread block into a different address 
and then sum them after the kernel finishes execution in device code 
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or add it into global memory atomically. In this example, we will do 
the latter: 


template< typename T > 
__global__ void reduce_vec_naive_kernel(T* sum, 
T* vec, size_t size) 
{ 
extern __shared__ T s_vec[]; 
T curr_sum = 0; 


/* Indexes for two elems in vec that this 
thread will load. */ 

int global_tid = threadIdx.x + blockDim.x * 
blockIdx.x; 

int grid_size = gridDim.x * blockDim.x; 


/* Load sum of (vec_size / grid_size) elems 
from vec to shared memory. */ 
for (size_t i = global_tid; i < size; i t= 
grid_size) 
curr_sum += veclil; 
s vec[threadldx.x] = curr_sum; 


/* Wait until all the threads of thread 
block load their value into shared memory. 
*/ 

__syncthreads(); 


/* Reduce the vector in shared memory, each 
iterations halves the number of active 
threads/warps. */ 

for (size_t stride = blockDim.x / 2; stride 
l= 0; stride /= 2 ) 

{ 

if (threadIdx.x < stride) 
s vec[threadIdx.x] += s vecl 
threadIdx.x + stride]; 
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/* Wait until all the threads of thread 
block finish this iteration. */ 
__syncthreads () ; 


/* Thread 0 adds this thread block’s sum 
into global sum. */ 
if (threadIdx.x == 0) 
atomicAdd(sum, s_vec[threadIdx.x]); 


We modify vector sum reduction to one that takes advantage of 
cooperative groups. First, we define two groups, g_block, and g_grid, 
representing the current thread block and grid. Then we load ele- 
ments from global memory as before, but now we use thread_block’s 
and grid_group’s member functions to quarry thread index and grid 
size. We can also explicitly synchronize groups with sync() mem- 
ber function. After the whole thread block finishes loading curr_sum 
into shared memory, we call reduce block cg shfl() device helper 
function that reduces the current thread block. After that, each block 
writes its sum into a different address in global memory. Now, thanks 
to cooperative groups, we can synchronize across all the threads of 
the grid, which enables us to get the final sum in device code without 
relying on atomic functions. 

In our block reduction member function, we partition the thread 
block into tiles composed of 32 threads. To make our implementation 
faster, instead of performing a reduction in shared memory, we use a 
shuffle-down member function, which exchanges variables between 
threads without the use of shared memory. Then first thread of each 
tile writes its tile sum into shared memory, where it is reduced into a 
final sum of the current thread block. 


namespace cg = cooperative groups; 

template< typename T > 

__device__ void reduce_block_cg_shfl_device( 
const cg::thread_block& g block, T* s_vec, Té 
sum ) 


const unsigned int tile_size = 32; 
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/* Partition thread block to warp sized 
groups. */ 

cg::thread_block_tile<tile_size> g tile = cg 
::tiled partition«tile size»(g block); 


/* Perform reduction on current warp/tile. 
*/ 
T tile sum = s vec[g block.thread rank()]; 
for (size t stride - tile size / 2; stride » 
0; stride /= 2) 
tile sum *- g tile.shfl down(tile sum, 
stride); 


/* Write this tile’s sum into shared memory. 
*/ 

if(g_tile.thread_rank() == 0) 
s vec[g block.thread rank()] = tile sum; 


/* Wait for all tiles to finish their 
reduction. */ 
g_block.sync(); 


/* Reduce the sums of all tiles of this 
thread block into final sum. */ 
if (g_block.thread_rank() == 0) 
for (int stride = 0; stride < g tile. 
meta_group_size(); ++stride) 
sum += s_vec[stride * g_tile.size() 


UE: 


/* Wait until final sum is calculated. */ 
g_block.sync(); 


template< typename T > 
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__global__ void reduce vec cg kernel(T* sum, Tx 


1 


vec, const size t * size) 


extern | shared T s vecll; 
T curr sum = 0; 
T block sum = O0; 


cg::thread block g block - cg:: 
this thread block(); 
cg::grid group g grid = cg::this grid(); 


/* Load sum of (vec size / grid size) elems 
from vec to shared memory. */ 
for (size t i = g grid.thread rank(); i < * 
size; i *- g grid.size()) 
curr sum += veclil; 
s vec[g block.thread rank()] = curr sum; 


/* Wait until all threads from current 
thread block finish loading. */ 
g_block.sync(); 


/* Reduce values loaded by current thread 
block. */ 

reduce block cg shfl device(g block, s vec, 
block sum); 


/* Write block sum into global memory */ 
if (g block.thread rank() == 0) 
vec[g block.group index().x] = block sum 


/* Wait until all the block write their sums 


e 0x/ 
g grid.sync(); 
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/* Reduce the thread block sums into final 


sum. */ 

if (g_grid.thread_rank() == 0) 
1 

*sum = 0; 

for (int i = 0; i < € grid.group dim().x 

; ++i) 
*sum += vec[i]; 

} 


6.2.2 API for programming Tensor Core operations 


New TCs (Tensor Cores) are one of the most significant features of 
Volta microarchitecture. TCs and their associated data paths are de- 
signed to increase floating-point compute throughput; each TC pro- 
vides a 4x4x4 matrix processing array that performs the operation D = 
AxB +C, where A, B, C, and D are 4x4 matrices. The matrix multiply 
inputs A and B are FP16 matrices, while accumulation matrices C and 
D may be FP16 or FP32.[39 | 

Multiple Tensor Cores are used concurrently by a warp during pro- 
gram execution. The threads within a warp provide a 16x16x16 matrix 
operation to be processed by the TCs. CUDA 9 includes a CUDA C++ 
API for warp-level matrix-multiply and accumulate. These include 
specialized matrix load, matrix multiply and accumulate, and matrix 
store operations to efficiently utilize TCs in CUDA C++ programs. 
In addition to providing interfaces to program TCs directly, CUDA 9 
also includes specialized libraries like CUTLASS and cuBLAS, which 
provide GEMM routines for Tensor Cores or cuDNN that support 
convolution operation on Tensor Cores.[29, 39 | 


6.3 CUDA 10 


CUDA 10 was released alongside Turing microarchitecture, which 
expands on advances made in previous microarchitectures while also 
introducing new features. The new enhanced API and SDKs add op- 
erations that take advantage of second-generation Tensor Cores and 
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enable scaled-up NVLINK-powered GPU systems. CUDA 10 expands 
the existing CUDA programming model with a new asynchronous 
task-graph model, enabling more efficient launch and execution. It 
includes libcu++, a parallel standard C++ library for GPUs; it pro- 
vides a heterogeneous implementation of the C++ Standard Library 
that can be used in both the CPU and the GPU code. CUDA 10 also 
includes some performance optimizations in CUDA libraries and other 
features.[40, 41] 


6.3.1 CUDA Graphs 


Many HPC applications such as deep neural network training and 
scientific simulations gave an iterative structure where the same work- 
flow is executed repeatedly. CUDA streams require the work to be 
resubmitted with every iteration, consuming both time and CPU re- 
sources. Graphs present a new model for submitting work, where 
execution flow is defined once and can be run repeatedly. A graph 
consists of a series of operations, such as memory copies and kernel 
launches, connected by dependencies and defined separately from its 
execution. [40] 

Apart from the application where the same workflow is executed 
repeatedly, graphs can present performance benefits for kernels with 
a short runtime, where the overhead of a kernel launch can be a signif- 
icant fraction of the overall end-to-end execution time. In such cases, 
separating the definition of a graph from its execution reduces CPU 
kernel launch costs and can make a significant performance difference. 
Graphs also enable a CUDA driver to perform some optimizations 
because the whole workflow is visible, improving performance in 
various cases. [40] 

Graphs can be created either with explicit API or stream capture. In 
the following example, we illustrate the former method with a graph 
that starts by multiplying two vector pairs, reducing their products, 
and adding the two results together. First, we create an empty graph 
and empty nodes: 


/* Create an empty graph. */ 
cudaGraph t graph; 
cudaGraphCreate(&graph, 0); 


56 


6. CUDA VERSIONS 


cudaGraphNode t mult 1; 
cudaGraphNode t mult 2; 
cudaGraphNode t reduce 1; 
cudaGraphNode t reduce 2; 
cudaGraphNode t add; 


Then we can use CUDA API to configure the nodes and add them 
to the graph along with dependencies: 


cudaGraphAddKernelNode(&£mult 1, graph, nullptr, 
0, &node params 1); 

cudaGraphAddKernelNode(&£mult 2, graph, nullptr, 
0, &node params 2); 

cudaGraphAddKernelNode(&£reduce 1, graph, nullptr 
, 0, &node params 3); 

cudaGraphAddKernelNode(£reduce 2, graph, nullptr 
, 0, &node params 4); 

cudaGraphAddHostNode(&£add, graph, nullptr, 0, & 
node params 5); 


cudaGraphAddDependencies(graph, &mult 1, & 
reduce 1, 1); 

cudaGraphAddDependencies(graph, &mult 2, € 
reduce 2, 1); 

const cudaGraphNode t from[] = { reduce 1, 
reduce 2 }; 

const cudaGraphNode t tol] = í final sum, 
final sum }; 

cudaGraphAddDependencies(graph, from, to, 2); 


The fifth argument of cudaGraphAddKernelNode() is cudaKer- 
nelNodeParams, which specifies the kernel launched and its argu- 
ments, thread block dimension, dynamic shared memory size, and 
other parameters. 

We can build an identical graph using stream capture, a mecha- 
nism for building graphs with existing stream APIs. We take a code 
that launches work into a stream and bracket it with cudaStreamBe- 
ginCapture() and cudaStreamEndCapture(). 


/* Begin stream capture. */ 
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cudaStreamBeginCapture(stream_1, 
cudaStreamCaptureModeGlobal) ; 


/* Fork into two streams. */ 
cudaEventRecord(fork_event, stream_1); 
cudaStreamWaitEvent (stream_2, fork_event); 


/* Run two vec multiplication concurently in two 
different streams. */ 

vec_multiply<T> <<<grid_dim_mult_1, block_size, 
0, stream_1 >>> (params.vec_a, params.vec_b, 
params.vec_c, params.size_c); 

vec_multiply<T> <<<grid_dim_mult_2, block_size, 
0, stream_2 >>> (params.vec_d, params.vec_e, 
params.vec f, params.size f); 


/* Run two vec sum reductions concurently in two 
different streams. */ 
sum reduction shfl«T» «««grid dim red 1, 
block size, shared mem size, stream 1 >>> ( 
params.sum c, params.vec c, *params.size c); 
sum reduction shfl«T»^ «««grid dim red 2, 
block size, shared mem size, stream 2 >>> ( 
params.sum f, params.vec f, *params.size f); 


/* Rejoin streams. */ 
cudaEventRecord(rejoin event, stream 2); 
cudaS$treamWaitEvent(stream 1, rejoin event); 


/* Record host function call. */ 
cudaLaunchHostFunc(stream 1, (cudaHostFn t)add«T 
>, (void*)&params); 


/* End the capture */ 
cudaStreamEndCapture(stream 1, &graph); 


A call to cudaStreamBeginCapture() places a stream into capture 
mode. When a stream is being captured, work launched into the stream 
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is not enqueued for execution. Instead, it is appended to an internal 
graph that is progressively being built up. The graph is then returned 
by calling cudaStreamEndCapture(), which ends capture mode for 
the stream.[29] As the code above illustrates, when a stream waits for 
an event that was recorded in some other stream that is being captured, 
it also puts the new stream into capture mode. The two streams are 
then both captured to the same graph. 

After we build the graph, we need to initiate it. Initiation takes a 
snapshot of the graph template, validates it, and performs much of 
the setup and initialization work to minimize what needs to be done 
at launch. The resulting graph may be launched into a stream, similar 
to any other CUDA work, any number of times without repeating the 
initiation.[29 | 
cudaGraphExec_t graph_instance; 
cudaGraphInstantiate(&£graph instance, graph 1, 

nüllptr, nullptr, 0); 


for(size t i = 0; i € 10; ++i) 
cudaGraphLaunch(graph instance, stream); 


6.4 CUDA 11 


The GPUs of Ampere microarchitecture presented many new hard- 
ware capabilities both on the side of consumer GPUs and data center 
accelerators. The Ampere GPUs come with a new Tensor Cores gener- 
ation and provide L2 cache residency control. The A100 GPU added 
support for Multi-Instance GPU partitioning capability. The CUDA 11 
enables programmers to leverage all these new hardware features. It 
also presents some performance optimizations in CUDA libraries and 
updates to CUDA graphs and cooperative groups.| 42] 


6.4.1 Memory Management 


One of the optimization strategies to maximize the performance of a 
GPU kernel is to minimize data transfers. If the memory is resident in 
global memory, the latency of reading data into the L2 cache or into 
shared memory might take several hundred processor cycles. CUDA 
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11 (on supported GPUs) offers API operations to set aside a portion 
of the L2 cache to persist data accesses to global memory. Persisting 
accesses have priority use of this set-aside portion of an L2 cache. L2 
persistence can be set for use in CUDA stream or CUDA graph kernel 
node. The part of an L2 cache can be set aside for persisting access with 
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size) function 
call. And then with setting an access policy in a stream attribute to 
cudaAccessPropertyPersisting.[42] 

CUDA 11 lets programmers take advantage of a new asynchronous 
copy paradigm. It essentially overlaps copying data from global to 
shared memory with computation and avoids using intermediate 
registers or the L1 cache. Async-copy has several benefits: control 
flow no longer traverses the memory pipeline twice, and not using 
intermediate registers can reduce register pressure, increasing kernel 
occupancy.|42] 


6.4.2 Cooperative Group Enhancements 


The CUDA 11 cooperative groups add two new group portioning 
methods. First is the labeled partition, a collective operation that parti- 
tions the parent group into one-dimensional subgroups within which 
threads are coalesced; coalesced threads are threads that remain active 
on the path in the data-dependent condition branch. The implemen- 
tation will evaluate a conditional label (given as an argument) and 
assign threads with the same value for the label into the same group. 
The second partitioning method, binary partition, is similar to labeled 
partition, where the label can only be evaluated to 0 or 1.[29] 

The CUDA 11 also adds several new group collectives for data 
transfer and manipulation. A group-wide collective memory copy 
memcpy_async() that utilizes hardware-accelerated support for non- 
blocking memory transactions from global to shared memory. The 
wait() collective that synchronizes the named group of threads and 
blocks until all unfinished memcpy. async requests are completed. 
The new data manipulation collectives include reduce() and inclu- 
sive_scan() /exclusive_scan(). Both reduce and scan perform a re- 
duction/scan operation on the data provided by each thread of the 
group.[29] 
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OpenCL 3.0 realigns the OpenCL roadmap to enable developer-requested 
functionality to be broadly deployed by hardware vendors. It increases 
deployment flexibility by empowering conformant OpenCL implemen- 
tations to focus on functionality relevant to their target market.[ 43] 
This is achieved by slicing all functionality beyond OpenCL 1.2 into 
optional features that can be queried in the API, with macros to indi- 
cate whether optional OpenCL C language features are present.[ 44] 
With the release of the R465 display driver, Nvidia added support for 
OpenCL 3.0 on Maxwell and later GPUs on both Windows and Linux. 
The existing OpenCL 1.x based applications continue to work with 
NVIDIA's OpenCL 3.0 drivers without any changes. In addition to full 
OpenCL 1.2 compatibility, NVIDIA's OpenCL 3.0 drivers now deliver 
some optional OpenCL 3.0 functionality, and some experimental 2.0 
features. [43, 45] 


7.4 Supported optional OpenCL 3.0 features 


In this chapter we describe some of the major optional OpenCL 3.0 
features supported by NVIDIA GPUs. 


7.1.1 RGBA vector component naming 


The vector data types can be addressed with vec.rgba syntax. In the 
example below, we illustrate accessing of components in vectors: 


int4 vec a = (int4)(0); // vec a = (0, 0, O, 0) 
vec a.r = 1; Z7 wec wu = (1, 0, 0, 0) 
vec a.ga = (int2)(2, 4); // vec a = (1, 2, 0, 4) 
vec a.b = 3; // vec.a = (1, 2, 3, 4) 


float3 vec b; 
vec b.rpbo (floats) (1.1%, 2.2f, 3.3f); // wees 
= (Ifa S.S]. gof? 
vec b.a = 4.4f; // illegal, vec b does mot have 
4th element 
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7.1.2 Copying Kernel Objects 


Cloning is used to make a shallow copy of the kernel object, its ar- 
gument, and kernel execution information. The cloning is done with 
clCloneKernel() function. The returned object is an exact copy of the 
source kernel, with one caveat: the reference count on the returned ker- 
nel object is set as if it had been returned by clCreateKernel() function, 
and the reference count of the source kernel will not change.[33 


7.1.3 Shared virtual memory 


OpenCL extends the global memory region into the host memory 
region through an SVM (Shared Virtual Memory) mechanism. There 
are several types of SVM in OpenCL. NVIDIA adds support for coarse- 
grained buffer SVM, where sharing occurs at the granularity of regions 
of OpenCL buffer memory object. Consistency is enforced at synchro- 
nization points and with map/unmap commands to drive updates be- 
tween the host and the device. This form of SVM is similar to non-SVM 
use of memory; however, it lets kernel-instance share pointer-based 
data structures (such as linked lists) with the host program.[33, 45] 

NVIDIA also adds support clEnqueueSVMMigrateMem() as one 
of optional features from OpenCL 2.1. It enqueues a command to 
indicate which device a set of ranges of SVM allocations should be 
associated with; the SVM allocation will be migrated to the device as- 
sociated with the command queue given as an argument. It also offers 
an option to migrate a set of SVM allocations to host memory.[33 | 

In the example below, we allocate two SVM buffers representing 
vectors. We map the vec a into host memory for initialization and 
unmap it afterward. 


/* Create shared virtual memory buffers. */ 


int* vec a = (int*)c1SVMAlloc(curr context, 
CL MEM READ WRITE, vec size mem, sizeof(int)); 
int* vec b = (int*)c1SVMAlloc(curr context, 


CL MEM READ WRITE, vec size mem, sizeof(int)); 


/* Map vec a into host memory for modification. 


ay 
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clEnqueueSVMMap(cq_1, CL_BLOCKING, CL_MAP_WRITE, 
(void*)vec_a, vec_size_mem, 0, NULL, NULL); 


init_vec(vec_a, vec_size); 


/* Unmap vec_a from host memory. */ 
clEnqueueSVMUnmap(cq_1, (void*)vec_a, 0, NULL, 
NULL) ; 


We can set SVM buffers as kernel arguments with a clSetKer- 
nelArgSVMPointer() API call: 


clSetKernelArgSVMPointer(vec_copy_kernel, 0, 
vec_a); 

clSetKernelArgSVMPointer(vec copy kernel, 1, 
vec b); 


After the kernel finishes execution, we migrate the vectors into 
host memory. We can use the same pointers to access them: 


/* Migrate vector from device memory to host 
memory. */ 


const int n_of_ptrs = 2; 

const void* svm ptrs[n of ptrs] = ívec a, vec b 
}; 

const size_t sizes[n_of_ptrs] = { vec_a_size_mem 


, vec_b_size_mem }; 

ret = clEnqueueSVMMigrateMem(cq 1, n of ptrs, 
svm ptrs, sizes, CL MIGRATE MEM OBJECT HOST, 
0, NULL, NULL); 


int val = vec a[0]; 


7.2 Experimentally supported OpenCL 2.0 features 


7.2.1 Device Side Enqueues 


Algorithms may need to generate additional work as they execute. 
In many cases, this additional work cannot be determined statically, 
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so the work associated with a kernel only emerges at runtime as the 
kernel instance executes. OpenCL offers a device-side kernel enqueue 
command to enable nested parallelism. The kernel executing on a 
device enqueues a kernel instance to a device-side command queue, 
an out-of-order command queue, which follows the same behavior as 
the out-of-order command queues exposed to the host program.[33] 

In the example below, we illustrate how to enqueue ker_B kernel 


in ker_A kernel: 


__kernel void ker_B(__global int* a, unsigned 


int n) 
{ 
} 
__kernel void ker_A(__global int* a, __ global 
int* b, unsigned int n) 
{ 
ndrange t ndrange = ndrange 1D(...); 
enqueue kernel(get default queue(), 
CLK ENQUEUE FLAGS WAIT KERNEL, ndrange, 
ker B(a, n); +); 
} 


“A 
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Conclusion 


In this work, we first provided a summarized description of the in- 
novations, features, and properties of NVIDIA GPUs of Pascal, Volta, 
Turing, and Ampere microarchitecture. We described some of their 
possible use cases, limitations, or impact on performance and pro- 
grammability. Then we introduced motivation, basic concepts, and 
possible benefits of general-purpose GPU programming. We covered 
all the basic principles needed to understand and write simple pro- 
grams for GPU in CUDA and OpenCL, like programming model and 
hardware implementation. Moreover, we also showed and explained 
how to efficiently use, transfer, and access data in GPU memory spaces. 
Then we described features added in newer CUDA and OpenCL ver- 
sions and showed how these new features can be used and what 
advantages or new possibilities they present. Lastly, we provided code 
examples that demonstrate the basics as well as the use of more ad- 
vanced features on some practical problems, like matrix multiplication 
or vector sum reduction. 
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